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In  large  scale  inverted  file  information  retrieval  systems  implemented  on 
conventional  digital  computers,  it  is  possible  for  more  time  to  be  spent 
processing  and  merging  the  index  lists  than  on  any  other  activity.  However, 
this  non-numeric  processing  cannot  be  performed  efficiently  by  general  purpose 
digital  computers,  thus  reducing  either  the  response  time  of  the  system  or  the 
number  of  simultaneous  users.  This  paper  describes  a  simple  processor  which 
can  efficiently  process  these  lists  while  the  main  computer  devotes  itself  to 
other  tasks. 

It  is  also  possible  to  combine  these  merge  processors  into  networks  which 
can  process  complex  expressions  directly,  with  no  requirement  for  the  storing 
and  later  refetching  of  intermediate  results,  This  eliminates  the  need  for 
memory  cycles  to  store  and  later  refetch  these  intermediate  results, 
effectively  increasing  the  available  memory  bandwidth. 

Designs  for  both  word-parallel  and  bit-serial  implementations  of  the 
merge  processor  are  presented.  The  modifications  necessary  for  these 
processors  to  be  connected  as  a  network,  and  in  particular  to  form  a  binary 
tree,  along  with  algorithms  for  parsing  expressions  which  can  be  contained  in 
the  available  tree  and  which  are  too  large  and  must  be  subdivided,  are  also 
discussed. 
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"But  you  can't  look  up  all  those  license  numbers 
in  time,"  Drake  objected. 

"We  don't  have  to,  Paul.   We  merely  arrange  a  list 
and  look  for  duplications." 


-Perry  Mason 
(The  Case  of  the  Angry  Mourner,  1951) 


iv 


TABLE  OF  CONTENTS 


Page 
CHAPTER  1   —  INTRODUCTION  TO  THE  PROBLEM  .  < .  . -I 

1 . 1  Information  Retrieval  Queries 2 

1 . 2  Database  Structure  .....**........... H 

1.3  Zipf's  Law  and  the  Level  of  Inversion  ........ 7 

Hi  Processing  on  Conventional  Computers  . .. 13 


CHAPTER  2  —  A  SPECIALIZED  LIST  MERGING  PROCESSOR  2 


0 


2 . 1  Previous  List  Merging  Processors 20 

2.2  A  Simple  List  Merging  System  .  . .  * *....- 23 

2.3  Parallel  Element  Implementation  ...... .i ....... . 26 

2.4  Serial  Element  Implementation  . 30 

CHAPTER  3  —  MERGE  PROCESSOR  NETWORKS ....... 37 

3 . 1  Network  Bandwidth  Considerations  ....... 37 

3 • 2  Merge  Network  Hardware  Considerations 41 

3.-3  A  Serious  Problem  ........  i  *.....-.  ..**.. 247 

3.4  Parsing  Expressions  for  a  Fixed  Tree  Size  *»...»«<»,». 55 


CHAPTER  H   —  FUTURE  RESEARCH  ,.....,-.,,.*...-..*...*. 6 


4  .- 1   Higher  Bandwidth  Memories  •■ ...  s  .**«.-.  4  .•  s *.*...?.-.....*.  61 

4 . 2  Mult  i-User  Systems  •■.*•«.*.**.,.»•*.•*»«..*,»»« 65 

CONCLUSIONS >  *  * ........ .  s  ....... . .  . . . , .  ,  . . , . , .  .  68 

LIST  OF  REFERENCES  ..........-..,.,.,......,.,.-.--.,...., ,  70 

VITA  . . •*«»•*...•**#.«,,*,*»*,,,»„,,„,»,»,».,,,,,,,,,.,,,  72 


LIST  OF  FIGURES 


Page 

1  . 1  Context  Hierarchies  ....  * 3 

1  .2  Logical  Database  Organization 6 

1.3  Table  of  Word  Frequency  in  "Ulysses"  .*.....*».- 9 

1.4  Zipf  Curve  for  "Ulysses"  *..*.. 9 

1.5  Zipf  Curves  for  State  Statutes ..*........  . 10 

1.6  Truncated  Zipf  Curve  .■ 5 12 

1  .-7  Savings  Using  Complementary  Lists  ...........  ^ ...  t 12 

1  .8  AND  Subroutine  from  EUREKA  System  ........... , 15 

1 .9  List  Entry  Format  for  AND  Subroutine  . . . . 16 

1  . 1 0  Operations  Table  ........  s .......  t .. .  . ...................  s  » 17 

2  . 1  Stellhorn/Batcher  Merge  System  . ...  t ............. t . t ...... . 22 

2.2  HARVEST  Functional  Units  .*«*«*»<,•«*.•»««,..•,*.».•,,<« 22 

2 . 3  Merge  System  Configuration  ...........  i .. . 24 

2.4  Complex  Merge  System  Configuration .  24 

2  . 5  Merge  Element  Block  Diagram  ..................  ^ .................... .  27 

2.6  Ripple  Carry  Number  Comparator  .  ^  ........ a  ... iX ... . 29 

2.7  Parallel  Count  Field  Generator  ................... .. .. 29 

2.8  Serial  Merge  Element  Block  Diagram  *......*...* ........ ....  31 

2.9  Basic  Serial  Merge  Element  . ........  i . 31 

2.10  Complete  Serial  Merge  Element 33 

2.11  Serial  Element  Control  Flow  Chart  34 

2.12  Element  Performance  ........ . . 35 

3.1  Network  to  Process  a  Complex  Expression  ......  i 38 

3 .2  Binary  Tree  Network  Configuration 43 

3 . 3  Network  with  PASS  Element  ..................  t 44 

3 . 4  Network  Gate  Counts  *....**.* .. 46 

3 .5  Network  Deadlock ......... . 48 

3  .6  Modified  Operations  Table * .  *  * . . .  49 

3.7  Modified  Network  Operation  . . 51 

3.8  Multi-Pass  Expression  Processing 58 

3 . 9  Input  Pair  Classes  t ..........  iV. ............. t ...... . . -. .  58 

4  , 1  Simulation  Result s  ............ ^........^..........g.........  64 


CHAPTER  1  --  INTRODUCTION  TO  THE  PROBLEM 


This  thesis  deals  with  the  design  and  use  of  a  specialized  data 
processing  system  in  connection  with  a  conventional  digital  computer  running  a 
large  scale  information  retrieval  system  using  an  inverted  file  database 
structure.  It  describes  the  operation  of  a  representative  information 
retrieval  system  implemented  on  a  standard  digital  computer,  with  emphasis  on 
the  structure  of  the  database  and  the  types  of  queries  generally  made  of  it. 
Then  it  will  be  shown  that  the  processing  of  standard  queries  frequently 
requires  the  merging  of  two  or  more  ordered  lists  of  index  term  pointers.  The 
design  of  a  simple  processor  for  aiding  in  this  and  a  technique  for  increasing 
the  effective  memory  bandwidth  available  by  connecting  a  number  of  these  merge 
processors  in  a  network  will  be  presented.  The  implications  to  both  the 
hardware  design  and  the  methods  used  to  parse  expressions  when  the  network  is 
in  the  form  of  a  fixed  binary  tree  concludes  the  third  chapter*  Finally,  a 
number  of  areas  for  future  research  are  proposed,  with  preliminary  results  in 
these  areas  given* 

This  thesis  will  not  try  to  justify  the  use  of  a  particular  file 
structure,  data  format,  or  form  of  query.  These  problems  have  been  discussed 
at  great  lengths  in  the  past ,  and  to  do  so  here  would  only  obscure  the  issue 
of  a  specialized  merge  processor.  Suffice  it  to  say  that  inverted  file 
information  retrieval  systems  exist,  that  large  scale  systems  of  this  type 
show  a  decrease  in  performance  due  to  the  disproportionate  time  spent  handling 
the  merging  and  correlation  of  the  index  terms,  and  that  a  specialized 
processor  presents  a  possible  solution  to  the  problem. 


1 . 1   Information  Retrieval  Queries 

A  representative  information  retrieval  system  can  be  considered  to  have  a 
command  which  locates  items*  within  the  database  which  match  a  given  pattern. 
Any  additional  commands,  such  as  those  which  print  out  the  results  of  the 
search,  create  sets  of  items  based  on  previous  searches,  and  the  like,  are  not 
important  to  the  discussion  of  list  merging.  This  command  can  find  all  items 
which  contain  one  or  more  of  a  set  of  explicitly  specified  character  strings 
(with  the  command  in  the  form  "FIND  'AARDVARK'"  to  locate  all  occurrences  of 
AARDVARK,  or  "FIND  'AARDVARK'  OR  'AARDWOLF'"  to  find  all  items  which  contain 
either  AARDVARK  or  AARDWOLF),  This  union  of  terms  can  also  be  specified 
implicitly  by  the  use  of  "explosion"  or  "wildcard"  functions.  For  example, 
EXPLODE( 'HOUND')  may  produce  an  operation  equivalent   to   'BASSET  HOUND'   OR 

'BEAGLE'   OR   Similarly,   PROGRAM*   (where   #  is  assumed  to  match  any 

arbitrary  character  string,  including  the  null  string)  may  be  equivalent  to 

'PROGRAM'   OR   'PROGRAMMED'   OR   'PROGRAMMER'  OR  ,  with  as  many  terms  as 

there  are  words  in  the  database  which  begin  with  PROGRAM. 

In  addition,  the  command  allows  searching  for  two  or  more  terms  occurring 
within  a  given  context.  While  the  meaning  of  context  varies  from  system  to 
system  depending  upon  the  inherent  heirarchy  of  the  data  stored,  the  contexts 
given  in  Figure  1.1  for  EUREKA  will  be  assumed.  An  example  of  this  second 
form  of  the  command  is  "FIND  'BEOWULF'   AND   'GRENDEL'   IN  PARAGRAPH",   which 


*  An  item  is  the  primary  entity  in  the  representative  information  retrieval 
system.  When  inverted  files  are  present,  it  represents  the  context  to 
which  the  inversion  was  made,  and  which  must  be  searched  if  a  lower  context 
is  specified.   For  EUREKA  [1],  an  item  corresponds  to  a  document. 
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Figure  1.1-  Context  Hierarchies 


matches  all  items  which  have  a  paragraph  containing  both  BEOWULF  and  GRENDEL. 
If  no  context  is  specified,  it  is  assumed  that  the  request  is  for  the 
co-occurrence  of  the  terms  in  the  same  item.  A  special  form  of  the  AND 
connective  exists  in  the  form  "FIND  'AREA  NAVIGATION'".  This  expression  asks 
the  system  to  find  a  sentence  which  contains  both  AREA  and  NAVIGATION, 
occurring  adjacently  and  in  that  order. 

A  third  form  of  an  expression  is  "FIND  'COMPUTER'  AND  NOT  'ANALOG'"  which 
finds  all  items  which  contains  the  word  COMPUTER  but  not  the  word  ANALOG.  It 
is  easy  to  see  that  the  OR  connective  increases  the  number  of  items  meeting 
the  test,  while  the  two  forms  of  the  AND  decrease  the  number.  This  is 
important,  since  the  primary  function  of  the  information  retrieval  system  is 
to  reduce  the  number  of  items  which  must  be  examined  for  relevancy. 

These  three  forms  can  also  be  combined  to  make  a  more  complex  expression, 
for  example: 

FIND  EXPLODE('RNAV')  OR  'OMEGA'  AND  'DIGITAL  COMPUTER'  IN 

SENTENCE  OR  'PROGRAM*'  AND  NOT  'INERTIAL' 

AND  'NAVIGATION'  IN  PARAGRAPH 

1.2  Database  Structure 

Probably  the  easiest  database  organization,  both  conceptually  and  in 
terms  of  implementation,  is  one  consisting  of  all  the  textual  material 
organized  in  a  single  file  which  is  searched  sequentially  in  response  to  a 
user's  query.  The  programming  necessary  to  implement  the  previously  discussed 
connectives  is  fairly  obvious.  The  only  disadvantage  to  this  type  of 
organization   is  that  the  time  required   for  each  search  on  a  conventional 


digital  computer  increases  linearly  with  the  number  of  characters  (or  items) 
stored.  For  a  batch  processing  system  this  may  not  be  a  serious  problem, 
since  many  user's  requests  can  be  satisfied  in  a  single  pass  thru  the 
database.  However,  for  a  system  like  the  National  Library  of  Medicine's 
MEDLINE  [2],  with  500,000  documents  available  for  online  user  inquiries,  the 
batching  of  requests  would  be  impractical,  while  the  initiation  of  a  search 
thru  the  entire  database  for  each  user  request  is  impossible  if  adequate 
response  times  for  a  large  number  of  users  is  to  occur. 

The  answer  to  this  problem  is  to  provide  an  index  to  the  material  which 
can  be  checked  to  eliminate  the  needless  searching  of  items  which  do  not 
contain  the  desired  terms.  This  index  can  be  prepared  manually  by  trained 
indexers,  or  automatically  by  a  computer.  In  the  latter  case,  all  words  can 
be  indexed,  only  certain  words  from  a  predefined  list  can  be  indexed,  or  all 
words  except  those  on  an  exception  list  (such  as  THE,  AND,  A,  etc.)  can  be 
indexed. 

Figure  1.2  illustrates  the  logical  organization  of  the  database  in  an 
inverted  file  structure.  The  index  file  contains  lists  of  pointers  to  items 
in  the  text  file,  and  other  data  necessary  for  the  operation  of  the  system. 
Additionally,  the  list  entries  may  contain  context  flags,  to  indicate  that  the 
word  is  also  contained  in  some  context  outside  of  the  body  of  the  item  (such 
as  the  title),  and  other  data,  such  as  a  count  field  to  indicate  the  frequency 
of  occurrence  of  the  word  in  the  item. 

As  long  as  the  expressions  imply  the  item  level  for  a  context,  the 
requested  operations  can  be  performed  without  actually  searching  the  textual 
material.   For  an  OR  operation,  all  that  is  required  is  to  form  the  union  of 
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Figure    1.2  -   Logical   Database   Organization 


'two  or  more  index  lists,  for  AND,  the  intersection,  and  for  AND  NOT,  the 
removal  of  entries  contained  in  the  second  argument's  list  from  the  list 
specified  by  the  first*  For  requests  specifying  a  lower  level  of  context, 
such  as  the  sentence,  the  result  of  the  merging  of  the  index  lists  gives  a  set 
of  items  which  have  a  chance  of  success  in  a  full-text  search.  This  saves 
processor  time  by  not  requiring  the  searching  of  items  which  cannot  possibly 
match  the  specified  search  expression. 

1.3  Zipf's  Law  and  the  Level  of  Inversion 

Since  the  storage  requirements  for  an  inverted  file  system  depend  upon 
not  only  the  total  number  of  words  in  the  database  (the  number  of  tokens) ,  but 
also  the  number  of  distinct  words  (the  types)  and  the  number  of  times  each 
type  occurs  (the  number  of  tokens  per  type) ,  estimating  the  requirements  is 
more  difficult  than  with  a  full  text  searching  system.  However,  it  has  been 
observed  for  a  number  of  natural  phenomena,  including  large  collections  of 
text,  that  when  the  types  are  ranked  according  to  their  number  of  tokens,  the 
product  of  the  rank  of  the  type  and  the  number  of  its  tokens  is  constant. 
This  is  referred  to  as  Zipf's  Law*.  When  a  constant  product  is  plotted  with 
the  rank  of  the  type  on  the  x-axis,  and  the  number  of  tokens  in  the  type  on 
the  y-axis,  the  graph  is  in  the  form  of  a  rectangular  hyperbola.  If  this  same 
curve  is  plotted  on  log-log  coordinates,  the  result  will  be  a  straight  line. 


*  After  George  Kingsley  Zipf,  a  professor  of  German  at  Harvard  University. 
Zipf  initially  studied  the  distribution  of  words  in  a  variety  of  languages 
and  from  a  number  of  sources,  observing  approximately  the  same  results.  He 
regarded  these  results  as  some  sort  of  universal  truth,  which  he  called  the 
Priciple  of  Least  Effort,  and  attempted  to  use  it  to  explain  the  Civil  War, 
Committees  of  Congress,  chamber  music,  the  Chicago  Tribune,  and  sex.   [3»^] 


Figure  1.3  represents  data  collected  by  Dr*  Miles  Hanley  and  Dr.  M. 
Joos  on  the  distribution  of  words  in  James  Joyce's  novel  "Ulysses"  [5].  This 
work  was  initially  selected  to  demonstrate  that  the  Zipf  distribution  would 
not  exist  in  a  sample  of  this  size  (260,430  tokens).  It  can  be  seen  by  the 
product  column  in  the  table  and  from  the  graph  of  the  data  on  log-log 
coordinates  in  Figure  1*4,  that  even  a  sample  of  this  size  confirms  Zipf's 
observations.  The  abnormality  at  the  right  side  of  the  curve  exists  because  a 
word  can  only  occur  an  integral  number  of  times;  the  final  step,  for  example, 
is  between  words  which  occur  twice  and  those  which  occur  once* 

Figure  1.5  show  the  curves  for  a  large  database  consisting  of  the 
statutes  of  a  state.-  The  curve  for  inversion  to  the  word  level  corresponds  to 
the  actual  Zipf  curve,  and  approximates  a  straight  line  when  plotted  log-log. 
The  other  two  curves  indicate  that  as  the  inversion  is  made  to  a  higher  level, 
the  curve  flattens  on  the  left  side,  as  a  result  of  a  greater  number  of  words 
which  appear  in  all  or  nearly  all  items  at  the  level  to  which  the  inversion 
was  made.  At  the  sentence  level,  about  a  dozen  words  (such  as  A  and  THE) 
occur  in  at  least  half  the  items,  while  at  the  document  level,  with  a  document 
corresponding  to  a  chapter  of  the  statutes,  more  that  300  words  occur  in  over 
half  the  items. 

It  is  customary  in  many  information  retrieval  systems  to  delete  common 
words  from  the  index  file,  both  to  conserve  space  and  to  prevent  the  user  from 
making  a  query  which  would  result  in  an  inordinate  number  of  items  matching. 
However,  this  imposes  a  rather  arbitrary  restriction  on  the  system  user,  since 
it  is  possible  for  him  to  meaningfully  use  these  words  in  a  query.   This  query 
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Figure  1.3  -  Table  of  Word  Frequency  in  "Ulysses" 
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Figure    1.4  -   Zipf  Curve  for   "Ulysses" 
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would  generally  take  the  form  of  A  AND  B  or  A  AND  NOT  B,  where  A  is  a  list 
which  has  a  small  number  of  entries,  and  B  with  a  large  number.  Both  have  as 
results  a  list  with  the  number  of  entries  less  than  or  equal  to  the  number  of 
entries  in  A,  while  the  result  of  A  OR  B  is  of  the  same  order  as  B  (the  large 
list).  However,  the  first  two  expressions  can  be  rewitten  as  A  AND  NOT  (NOT 
B)  and  A  AND  (NOT  B) .  Therefore,  a  list  indicating  the  items  which  do  not 
contain  a  term  can  be  stored  if  the  term  is  contained  in  more  than  half  the 
total  number  of  items. 

The  saving  achieved  by  using  this  technique  depends  upon  the  frequency  of 
occurrence  of  the  words  and  the  total  number  of  tokens  at  the  inversion  level 
selected.  A  simple  model  which  can  be  used  consists  of  a  line  with  slope  -1 
for  the  Zipf  curve  of  the  database  inverted  to  the  word  level,  as  illustrated 
in  Figure  1.6.  When  the  inversion  is  made  to  a  higher  level,  all  points  on 
the  curve  with  a  frequency  greater  than  the  total  number  of  items  at  the 
inversion  level  are  made  equal  to  the  total  number  of  items.  In  reality,  the 
changing  of  the  inversion  level  not  only  truncates  the  left  hand  portion  of 
the  curve,  but  also  decreases  the  slope,  moving  the  point  where  the 
negatively-sloped  line  meets  the  horizontal  line  left. 

Figure  1.7  illustrates  the  savings  achieved  by  using  complemented  lists. 
It  must  be  remembered  that  the  results  are  plotted  on  log-log  coordinates,  so 
the  relative  sizes  of  the  areas  are  misleading.  However,  it  does  illustrate 
where  the  saving  is  achieved.  With  the  simple  model  for  a  higher  level 
inversion,  using  complementary  lists  results  in  a  savings  of  15  to  25  percent 
in  the  size  of  the  index  file.  On  actual  data  (the  statutes  mentioned 
previously)  the  decrease  in  size  is  15  percent  at  the  document  level  and   10 
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Figure    1.6   -  Truncated   Zipf  Curve 
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Figure  1.7  -  Savings  Using  Complementary  Lists 
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percent  at  the  sentence  level,  due  to  differences  between  the  simple  model's 
curve  and  the  actual  curve. 

Another  consideration  is  the  extra  processing  required  to  check  contexts 
either  above  or  below  the  item  level  [6].  Earlier,  it  was  noted  that  if  the 
context  specified  is  lower  than  the  item  level,  full-text  searching  may  be 
required  to  determine  if  a  match  actually  exists.  If  the  context  is  higher 
than  the  item  level,  then  a  change  in  the  method  used  for  combining  terms  is 
required.  This  can  be  done  either  by  having  the  item  pointer  contain  encoded 
information  regarding  the  higher  level  structure  (i.e.-  for  inversion  at  the 
sentence  level,  have  the  item  number  consist  of  fields  indicating  the  document 
and  paragraph  numbers  which  contain  the  sentence,  as  well  as  the  number  of  the 
sentence  within  the  paragraph) ,  or  by  defining  each  higher  entry  as  any  item 
within  a  fixed  range  of  another  item.  The  first  technique  is  inefficient  in 
its  use  of  bits  in  the  item  pointer,  while  the  second  is  inaccurate  due  to  the 
use  of  a  non-standard  definition  for  the  higher  levels. - 

However,  in  many  systems  operating  on  data  with  an  inherent  hierarchical 
structure  (such  as  Figure  1.1  shows  for  state  statutes  in  a  legal  database), 
it  is  possible  to  invert  to  an  optimal  level  which  minimizes  the  number  of 
context  requests  either  above  or  below  the  item  level. 

1.4  Processing  on  Conventional  Computers 

Most  currently  implemented  inverted  file  information  retrieval  systems 
run  on  standard  digital  computers  (for  example  MEDLINE  on  an  IBM  System/370 
and  EUREKA  on  a  PDP-11/40).  Estimates  made  by  the  operators  of  these  systems 
indicate  that  a  majority  of  the  processor  time  is  spent  in  the  routines  used 


HI 

for  fetching  and  merging  posting  lists. 

Figure  1.8  is  representative  of  the  instructions  used  in  the  EUREKA 
system  to  produce  the  AND  of  two  lists  both  contained  in  memory.  The  program 
has  been  written  to  take  advantage  of  the  high  speed  registers  available  on 
the  PDP-11  computer,  and  is  close  to  the  maximum  efficiency  for  doing  the 
operations  required.  Figure  1.9  illustrates  the  data  element  on  which  the 
program  operates.  The  first  part  of  the  program  checks  the  context  bits  to 
determine  if  the  entry  occurs  in  the  proper  context,  and  if  not,  fetches  the 
next  entry  on  the  list.  When  a  valid  entry  from  each  list  has  been  found,  the 
two  document  number  fields  are  compared,  and  if  they  are  not  equal,  a  new 
entry  is  fetched  from  the  list  containing  the  lower  document  number.  If  they 
are  equal,  an  output  entry  is  created  consisting  of  a  document  number  from 
either  of  the  two  input  entries,  a  count  field  equal  to  the  minimum  of  the  two 
input  count  fields  and  a  tag  bit  equal  to  the  OR  of  the  two  input  tags  is 
formed,  and  the  context  bits  are  cleared.  The  tag  bit  is  used  as  an 
indication  that  full-text  searching  is  required  for  an  entry,  since  if  one  of 
the  input  entries  required  full-text  searching  to  determine  if  it  is  a  valid 
entry,  then  any  entry  formed  from  an  AMD  operation  with  that  entry  will 
require  full-text  searching.  Figure  1.10  summarizes  the  operations  used  to 
generate  the  fields  for  all  three  operators. 

The  number  to  the  right  of  the  comment  field  for  each  instruction  is  its 
number  of  memory  references,  which  on  the  PDP-11  is  directly  proportional  to 
the  time  required  to  perform  that  instruction.  Assuming  that  an  average  merge 
operation  will  require  the  reading  of  one  entry  and  the  writing  of  one  entry 
(it  can  read  either  one  or  two  and  write  zero  or  one),  and  counting  the  number 
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15  12 

CONTEXT-T 
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DOCUMENT  NUMBER 


EVEN  WORD 


15 
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14 


CONTEXT-  II 


COUNT 


ODD  WORD 


Context-I  and  Context-II  form  a  group  of  13  flag  bits  [Ctx(A)] 
flags  which  indicate  the  contexts  within  which  the  term 
occurs.   The  assignment  is  arbitrary,  but  must  match 
the  assignment  used  for  the  context  check  mask.   In 
general,  it  indicates  contexts  other  than  Body,  Paragraph, 
or  Sentence. 

Document  Number  [Doc (A)]  is  a  pointer  or  indicator  of  the 

document  which  contains  the  term.   Due  to  implementation 
restrictions,  it  only  allows  4094  documents  in  the 
database. 

Count  [Cnt(A)]  indicates  how  many  times  the  term  occurs  within 
the  body  of  the  document.   Zero  means  it  is  only  in 
another  context  (Title,  Author,  etc.)  and  ^   means  it 
occurs  6  3  or  more  times. 

T  is  a  special  tag  bit  [Tag(A)]  that  indicates  a  result  of 
the  merge  operations  must  be  full-text  searched  to 
determine  if  it  actually  satisfies  the  specified 
conditions  of  the  query. 


Figure  1.9  -  List  Entry  Format  for  AND  Subroutine 
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of  cycles  required  on  the  average  to  carry  out  this  operation,  the  average 
merge  time  is  about  30  microseconds,  with  an  effective  bandwidth  of  2.HJ 
megaHertz.  The  actual  bandwidth  of  the  memory  (16  bits  per  650  nanoseconds) 
is  24*6  MHz.  Memory  efficiency  of  a  process  can  be  defined  as  the  effective 
bandwidth  of  the  process  divided  by  the  available  memory  bandwidth.  For  the 
previous  example,  this  is  about  8.8$.  This  number  is  similar  to  those 
calculated  for  other  general  purpose  digital  computers;  for  example  the  IBM 
System/360  Model  75  has  an  efficiency  of  6.4$,  due  to  its  higher  available 
bandwidth. 

This  low  efficiency  for  conventional  digital  processors  can  be  easily 
explained  by  examining  the  program.  Before  an  instruction  is  executed  it  must 
be  fetched  from  storage,  requiring  an  overhead  memory  cycle.  Because  of  this, 
even  if  every  instruction  completely  processed  a  word  of  the  input  or  output 
data,  the  efficiency  would  be  only  50$  -.  In  addition,  on  the  PDP-11  and  many 
other  computers,  an  instruction  may  consist  of  more  than  a  single  word, 
reducing  the  efficiency  even  more. 

In  addition,  there  are  instructions  in  the  program  which  do  not  process 
any  input  or  output  data.  These  can  be  divided  into  two  classes  —  flow  of 
control  and  locating  and  aligning.  The  flow  of  control  instructions  include 
branches  necessary  to  reach  other  statements  of  the  program  either 
conditionally  or  unconditionally.  The  second  class  is  used  to  find  the  next 
input  data  element  in  a  list  or  the  next  available  output  location  (adjustment 
of  pointers)  ,  or  to  transform  data  to  a  form  which  can  be  processed  by  the 
machine  (bit  masking,  shifting,  etc.). 
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It  is  clear  from  the  preceding  discussion  that  the  problem  of  merging 
lists  of  entries  does  not  nicely  match  a  conventional  digital  computer's 
architecture.  What  is  needed  is  a  processor  which  could  execute  instructions 
more  compatible  with  the  problem,  reducing  both  the  number  of  instructions 
which  must  be  fetched  and  the  need  for  flow  of  control  instructions. 
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CHAPTER  2  —  A  SPECIALIZED  LIST  MERGING  PROCESSOR 


Very  few  types  of  processors  have  been  proposed  to  conveniently  handle 
the  generally  non-numeric  task  of  merging  two  lists  of  data.  Most  non-numeric 
processors  are  associative  processors,  which  are  ideal  for  searching  large 
bodies  of  data,  but  not  for  combining  two  lists  and  eliminating  unwanted 
entries . 

The  implementation  of  a  specialized  processor  to  merge  two  input  lists 
into  a  single  output  list  is  simplified  by  the  nature  of  the  problem.  The 
operations  required  are  both  simple  and  well  defined,  allowing  a  hardwired, 
rather  than  programmed,  sequencer  for  speed  and  efficiency.  Operations  such 
as  pointer  and  count  manipulations  can  be  performed  in  parallel  with  the 
actual  merging  operation,  further  increasing  the  speed*  Finally,  the  data 
alignment  problem  present  on  conventional  processors  is  non-existent,  since 
the  data  can  easily  be  routed  to  the  appropriate  points  in  the  processor 
(assuming  the  data  format  is  fixed  or  falls  within  a  small  set  of  previously 
defined  formats) . 

2.1   Previous  List  Merging  Processors 

Two  different  styles  of  list  merging  processors  have  previously  been 
proposed:  the  bit  serial/entry  parallel  unit  discussed  by  Stellhorn,  and  the 
HARVEST  non-numeric  extension  to  the  IBM  STRETCH  computer  designed  for  the 
National  Security  Agency. 
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Stellhorn  [7]  proposed  using  a  Batcher  merge  network  [8]  to  combine  the 
two  input  lists  (see  Figure  2.1).  Since  it  is  not  practical  to  build  a 
Batcher  network  capable  of  merging  two  large  lists  (since  for  two  lists,  each 
containing  N  entries,  it  requires  order  N  log  N*  Batcher  merge  elements),  a 
technique  for  merging  the  lists  in  parts  was  devised.  This  consists  of 
merging  the  next  sublist  from  one  of  the  input  lists,  selected  based  on  the 
lower  first  entry,  with  the  last  half  of  the  results  of  the  previous  merge 
(which  are  fed  back  from  the  outputs  to  the  inputs  of  the  merge  network). 
Stellhorn  proved  that  this  technique  will  always  produce  a  properly  merged 
list. 

However,  this  network  only  produces  a  list  consisting  of  the  merged 
entries  of  the  two  input  lists;  no  action  is  taken  to  remove  duplicate 
entries  in  an  OR  operation  or,  more  importantly,  to  identify  these  duplicates 
as  the  only  correct  results  of  an  AND.  This  action  must  be  handled  by  an 
additional  unit,  the  coordination  network*  This  unit  must  examine  the  entries 
and  eliminate  those  which  are  not  proper  results.  It  then  must  repack  the 
data  in  the  output  buffers  and  wait  until  these  buffers  are  full,  because  some 
of  the  entries  from  the  Batcher  merge  network  may  have  been  eliminated. 

It  is  possible,  when  a  large  number  of  list  entries  are  being  processed 
in  parallel,  for  either  the  processing  time  (with  the  unit  proposed  by 
Stellhorn)  or  the  number  of  gates  (as  proposed  by  Lawrie  [9,10])  of  the 
coordination  network  to  be  greater  than  that  of  the  Batcher  merge  network! 


*  In  all  instances  in  this  thesis,  log  n  will  mean  the  logarithm  to  base  two 
of  n,  if  n  is  a  integral  power  of  two,  or  the  logarithm  of  the  next  higher 
power  of  two,  if  it  isn't. 
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The  second  form  of  merge  processor  is  similar  to  the  IBM  7950  HARVEST 
processor,  an  extension  to  the  IBM  7030  STRETCH  system  [11].  Figure  2.2  is  a 
simplified  diagram  of  the  functional  units  within  the  processor.  HARVEST  is 
programmed  by  having  STRETCH  pass  it  a  list  of  setup  instructions.  When  the 
processor  has  been  successfully  programmed  by  these  instructions,  a  start 
command  is  issued  by  STRETCH,  and  HARVEST  processes  the  streams  of  data  based 
on  the  instructions.  Facilities  exist  for  the  transformation  of  data 
controlled  by  table  lookup,  in  addition  to  logical  transformation.  This  table 
lookup  scheme  causes  the  processor  to  fetch  data  from  the  current  table  based 
on  a  function  of  the  input  characters.  This  data  can  consist  of  an  arbitrary 
number  of  output  characters,  including  none,  and  an  address  for  the  table  to 
be  used  next. 

2*2  A  Simple  List  Merging  System 

Figure  2.3  shows  the  major  data  paths  connecting  the  components  of  a 
large  scale  data  processing  system,  such  as  the  IBM  System/360  Model  75  -.  The 
large,  high  bandwidth  memory  is  connected  to  the  various  processors  by  a 
memory  bus  control  unit  (BCU) ,  which  acts  as  an  arbitrator  between  the 
potential  memory  users.  The  channels  have  the  highest  priority  access  to  the 
memory,  and  the  central  processor  the  lowest  *  The  channels  are  used  to 
relieve  the  need  for  character  assembly  by  the  central  processor,  and  to 
better  match  the  high  bandwidth  of  the  memory  to  the  low  bandwidth  of  the 
peripheral  units.  In  a  smaller  system,  the  BCU  is  replaced  by  a  simple  bus 
arbitration  protocol,  and  the  channels  by  including  direct  memory  access 
capability  in  control  units  which  transfer  large  amounts  of  data. 
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The  merge  processor  is  added  as  if  it  were  an  additional  processor  or 
channel,  with  control  information  exchanged  between  the  central  processor  and 
the  merge  processor,  and  the  merge  processor  transferring  data  to  and  from 
memory  thru  the  BCU.  In  this  configuration,  the  central  processor  issues  a 
command  to  the  merge  processor  indicating  the  memory  locations  for  the  input 
and  output  lists  and  the  type  of  operation  desired.  The  central  processor  can 
then  execute  other  tasks  until  the  merge  processor  completes  the  operation  or 
detects  an  error;  at  this  point  it  will  interrupt  the  central  processor  for  a 
new  command.  Due  to  the  high  data  requirements  of  the  merge  processor,  as 
compared  to  normal  input/output  devices,  and  the  fact  that  the  bus  control 
unit  grants  access  to  the  central  processor  only  when  another  unit  is  not 
requesting  it,  the  merge  processor  can  effectively  halt  execution  of  the 
program  running  on  the  central  processor  if  care  is  not  taken  to  periodically 
relinquish  memory  ownership  to  the  central  processor. 

A  more  complex  merge  processing  system  is  illustrated  in  Figure  2.4. 
This  system  contains  its  own  memory  and  disk  files,  so  its  interference  during 
operation  of  the  conventional  data  processing  system  is  minimized.  It 
consists  of  a  merge  processor,  a  disk  system,  a  channel  interface,  memory,  and 
a  scheduling  processor.  This  scheduling  processor  receives  requests  from  the 
host  processor,  queues  them  until  the  appropriate  resources  are  available, 
fetches  the  data  from  disk  into  memory,  merges  the  entries,  and  transfers  the 
result  either  to  disk  for  later  usage  or  to  the  host  processor  via  the  channel 
interface.  To  the  host  system,  this  configuration  appears  to  be  a  very 
intelligent  disk  system  which  has  all  possible  combinations  of  lists  stored. 
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The  merge  processor  can  be  implemented  either  as  a  parallel  or  serial 
unit.  As  is  generally  true,  the  parallel  unit  operates  considerably  faster, 
but  requires  an  increase  in  gates  greater  than  the  increase  in  its  speed  over 
the  serial  unit ,  However,  the  parallel  unit  is  easier  to  understand,  and  will 
be  discussed  first. 

2.3  Parallel  Element  Implementation 

A  general  block  diagram  of  the  parallel  merge  processor  [12,13]  is  given 
in  Figure  2.5.  Data  is  fetched  from  memory  by  either  the  X  or  Y  list  fetch 
logic  and  delivered  to  the  appropriate  mask  checker.  Here  the  context  bits 
are  checked  using  the  specified  mask  to  determine  if  the  entry  is  for  an  item 
in  the  proper  context;  if  it  is,  the  entry  is  placed  in  the  appropriate  input 
holding  register  and  that  register  is  marked  full*.  Fetching  of  list  entries 
continues  until  both  holding  registers  are  full.  At  this  time,  the  two 
document  number  fields  are  compared,  and  the  action  to  be  taken  is  determined 
based  on  the  operation  specified*  This  consists  of  forming  the  output  fields, 
marking  the  output  register  as  full  if  the  operations  table  specifies  the 
creation  of  an  output  entry,  and  indicating  that  either  or  both  of  the  input 
registers  are  empty 5   This  action  continues  until  the  lists  are  exhausted**. 


*  The  merge  processor  and  the  memory  interface  are  interlocked  using  a  bit 
for  each  of  the  inputs  and  for  the  output.  These  bits  are  set  by  the  data 
source  when  it  places  data  in  the  buffer,  to  indicate  the  connection  is 
full,  and  reset  by  the  data  sink  to  indicate  the  connection  is  empty,  and 
new  data  should  be  placed  in  it . 

**  In  the  case  of  A  AND  B,  processing  can  be  stopped  when  either  list  A  or 
list  B  is  exhausted,  rather  than  waiting  for  both  lists  to  be  exhausted. 
For  A  AND  NOT  B,  it  can  be  stopped  when  list  A  is  exhausted.  The  amount  of 
time  this  saves  is  highly  data  dependent,  and  will  be  ignored  in  future 
discussions. 
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The  major  unit  in  the  element  is  the  document  number  comparator,  which  is 
used  to  decide  which  input  holding  registers  should  be  marked  as  empty  and  how 
the  output  entry  should  be  formed.  Figure  2.6  illustrates  a  ripple-carry 
design  for  this  comparator,  including  the  equations  for  each  stage  and  the 
approximate  number  of  gates  and  the  delay.  The  output  XLOW  is  true  (high)  if 
the  document  number  in  the  X  input  register  is  less  than  or  equal  to  the  one 
in  the  Y  register;  YLOW  is  true  if  Y  is  less  than  or  equal  to  X.  For  higher 
speed  operation,  a  parallel  comparator  similar  to  the  SN7M85  MSI  unit  [14]  can 
be  used.  The  document  number  in  the  output  entry  is  generated  by  using  a  two 
input  selector  driven  by  the  document  number  fields  of  the  two  input  list 
entries,  and  using  either  the  XLOW  or  YLOW  signal  as  required  to  select  the 
lower  document  number. 

Figure  2.7  shows  the  simple  ALU/selector  used  to  form  the  output  count 
field.  A  comparator  examines  the  two  input  count  fields  to  determine  the 
lesser,  which  is  selected  if  an  AND  operation  is  being  performed.  Since  only 
one  of  the  four  AND  gates  is  selected  in  this  case,  the  adder  passes  the 
desired  count  field  directly  to  the  output*  If  either  an  OR  or  an  AND  NOT 
operation  were  specified,  the  proper  inputs  to  the  adder  are  selected  based  on 
the  XLOW  and  YLOW  control  signals  —  if  the  appropriate  nLOW  control  signal  is 
true,  the  Cnt(n)  data  is  fed  to  the  adder,  while  if  nLOW  is  false,  a  zero  is 
sent.  Other  output  generators,  such  as  for  the  tag  bit,  can  be  added  in  a 
similar  fashion. 
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2.H     Serial  Element  Implementation 

Figure  2.8  is  a  block  diagram  of  the  serial  implementation  of  the 
processor.  The  data  is  placed  in  parallel  into  the  X  or  Y  shift  registers 
from  memory-.  From  here,  it  is  sent  one  bit  at  a  time  to  the  merge  element, 
and  is  also  routed  back  to  the  shift  register  in  case  the  data  is  needed  by 
the  merge  element  on  the  next  cycle.  When  the  control  determines  that  a 
register  is  empty,  new  data  is  loaded  into  it  in  parallel  replacing  the  old 
data  collected  from  the  feedback  path. 

For  the  serial  element  to  function  properly,  it  must  be  sent  the  most 
significant  bit  of  the  document  number  first.  If  other  fields  are  to  be 
processed,  the  order  must  be:  document  number,  context  field,  tag,  and  count, 
with  all  numbers  sent  most  significant  digit  first.  Since  this  order  can  be 
produced  by  the  connection  between  the  holding  shift  registers  and  the  memory, 
there  are  no  special  format  requirements  for  data  in  the  memory. 

The  basic  logic  of  the  serial  element  is  shown  in  Figure  2.9,  and 
consists  of  two  flip-flops  (5  and  6,  and  8  and  9),  logic  to  set  these 
flip-flops,  and  a  network  to  select  one  of  the  inputs  based  on  the  state  of 
the  flip-flops.  At  the  start  of  the  operation,  both  flip-flops  are  reset, 
allowing  both  inputs  to  pass  thru  gates  1,  2,  and  3  to  the  output.  This 
continues  as  long  as  both  input  bit  streams  are  equal,  since  the  output  of 
gate  ?  inhibits  either  gate  4  or  gate  7  from  setting  the  flip-flops.  However, 
when  the  first  difference  in  the  inputs  occurs,  the  output  bit  is  forced  to  be 
a  zero  (since  this  is  the  bit  in  the  lowest  entry  at  this  time)  and  one  of  the 
flip-flops  is  set.  This  causes  the  output  to  follow  the  input  which  contained 
the  zero  bit. 
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The  addition  of  extra  fields  to  the  serial  element  is  not  as  simple  as  it 
was  for  the  parallel  element.  Figure  2.10  illustrates  the  complete  data 
section  for  a  unit  which  handles  count,  tag,  and  context  flags  in  addition  to 
the  document  number.  Gates  10,  11,  12,  and  13  are  used  to  examine  the  context 
flags  against  two  masks  sent  bit  serially  from  the  control  section.  If  a 
match  failure  occurs,  the  CONTEXT  signal  is  reset  and  XLOW  or  YLOW  is  set  to 
indicate  that  the  input  is  empty*  A  conventional  serial  adder  (gates  14,  1S, 
16,  17,  18,  and  19)  is  used  to  form  the  sum  as  required  for  the  count  field  in 
an  AND  operation.  For  an  OR  or  AND  NOT  operation,  the  basic  network  is  used 
to  form  the  minimum  or  select  the  desired  entry. 

The  actual  output  bit  is  selected  by  gates  20,  21,  and  22.   In  the  case 

of  the  tag  bit,  the  selection  of  the  AND  of  the  input  tags  required  by  the  OR 

operation  can  be  done  using  gate  3,  while  the  OR  necessary  for  the  AND 

operation  can  be  produced  using  both  3  and  19.   (Remember  that  A  OR  B  is 
equivalent  to  A  XOR  3  OR  A  AND  B.)  Figure  2.11  is  a  flow  diagram  illustrating 

the  order   in  which  these  control  signals  are  generated  for  the  different 
operations. 

Figure  2.12  summarizes  the  time  required  by  both  a  fully  parallel  and  a 
bit  serial  implementation  of  the  merge  processor  to  handle  entries  of 
representative  sizes.  Both  standard  TTL  logic  (11  ns  nominal  gate  delay)  and 
Schottly-clamped  TTL  (3  ns  gate  delay)  speeds  are  given.  Correspondingly 
higher  speeds  would  be  available  using  ECL  10000  or  other  logic  families  with 
gate  delays  of  1  ns  or  less.  In  the  case  of  the  parallel  implementation, 
speeds  for  similar  MSI  devices  were  used  instead  of  calculating  the  gate 
delays  for  the  selectors,  comparators,  and  adders  * 
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The  bandwidths  of  various  memory  devices  are  also  given  in  Figure  2.12. 
These  are  simply  the  number  of  bits  they  can  supply  in  one  second*  For  the 
word  oriented  memories,  such  as  processor  core  memories,  this  is  the  word 
length  divided  by  the  cycle  time.  The  bandwidth  for  a  merge  processor  is 
twice  the  number  of  bits  processed  in  one  second,  since  it  is  assumed  that,  on 
the  average,  each  cycle  will  use  one  input  entry  and  produce  one  output  entry. 
The  clock  cycle  for  the  serial  processors  are  30  ns  and  110  ns ,  to  allow  for 
cable  delays  and  control  signals  outside  the  merge  processor's  data  section. 

As  the  figure  illustrates,  even  the  serial  Schottky  TTL  processor  is 
faster  than  the  memory  on  the  PDP-11.  Both  parallel  implementations  are 
considerably  faster  than  the  processors'  memories,  and  up  to  200  times  faster 
than  the  3330-type  disk.  Since  conventional  computers  used  only  about  one  out 
of  fifteen  memory  cycles  to  actually  process  the  list  entries,  given 
sufficiently  fast  local  memory,  a  merge  processor  system  can  operate  over  one 
hundred  times  faster  than  conventional  systems. 
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Many  times,  a  query  consists  of  more  than  two  terms,  combined  using  a 
complex  expression  of  AND's,  OR's,  and  MOT's.  It  is  highly  desirable  to  be 
able  to  process  a  query  of  this  type  directly,  both  to  save  time  and  to  reduce 
the  requirement  for  storing  intermediate  results.  This  can  be  done  by  taking 
the  simple  merge  element  considered  in  Chapter  2,  making  minor  modifications 
to  the  design,  and  connecting  a  number  of  them  in  a  network  which  directly 
produces  the  desired  result. 

The  network  at  the  top  of  Figure  3.1  illustrates  how  AND,  NOT,  and  OR 
elements  can  be  connected  to  form  a  more  complex  expression.  Although  the 
network  performs  its  functions  on  list  entries,  creating  new  intermediate 
lists,  it  can  be  considered  much  the  same  as  Boolean  combinational  logic,  with 
the  same  laws  (association,  distribution,  etc.)  applying  to  the  expression. 
For  example,  the  network  which  is  described  by  (A  OR  B)  AND  (C  OR  D)  produces 
the  same  results  as  (A  AND  C)  OR  (A  AND  D)  OR  ( B  AND  C)  OR  (B  AND  D) . 

3-1   Network  Bandwidth  Considerations 

In  this  chapter,  the  network  is  assumed  to  operate  in  a  synchronized, 
pipelined  fashion,  with  the  memory  able  to  either  fetch  or  store  a  single  list 
entry  each  network  cycle.  In  most  cases,  this  assumption  regarding  the  memory 
speed  is  valid,  since  the  simple  design  of  the  merge  processor  allows  for  very 
high  speed  operation.  As  can  be  seen  in  Figure  2.12,  even  a  serial  merge 
processor  implemented  in  conventional  7400  series  TTL  logic  has  a  higher 
bandwidth  than  the  memory  of  a  DEC  PDP-11/40,   and  the  parallel  processor 
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Figure  3,1  -  Network  to  Process  a  Complex  Expression 
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constructed  using  Schottky  TTL  has  a  bandwidth  from  ?.  to  10  times  that  of  the 
interleaved  memory  on  the  IBM  System/360  Model  75.  Chapter  k  will  briefly 
discuss  some  of  the  results  obtained  when  the  memory  speed  of  the  network  is 
some  multiple  of  the  element  speed  greater  than  one. 

During  each  cycle,  every  network  element  examines  its  two  input  entries 
and  forms  an  output  according  to  the  rules  given  in  Chapter  1 .  If  the 
connection  to  its  successor  is  already  full  from  a  previous  cycle  and  the 
operation  produces  a  new  output,  it  waits  for  a  future  cycle  when  the 
successor  connection  is  empty.  Otherwise,  the  element  places  its  result,  if 
any,  on  the  connection,  and  marks  its  inputs  empty  as  indicated  by  Figure 
1.10.   The  table  in  Figure  3.1  illustrates  how  data  flows  thru  the  network. 

This  operation  continues  until  an  end-of-processing  entry,  which  is  a 
number  greater  than  the  highest  item  number  allowed,  reaches  the  output 
connection.  This  entry  is  introduced  to  the  network  when  the  fetch  logic 
detects  the  end  of  a  list.  An  OR  element  outputs  this  special  entry  when  both 
inputs  are  equal  to  it,  an  AMD  when  either  input  is  equal  to  it  (since  there 
will  be  no  use  trying  to  match  inputs  if  one  of  the  input  lists  is  exhausted) , 
and  an  AND  NOT  when  the  X  input  contains  the  special  entry. 

Why  should  a  number  of  merge  processors  be  connected  together  to  form  a 
network  when  the  memory  is  capable  only  of  fetching  or  storing  a  single  entry 
in  the  time  that  it  takes  the  element  to  perform  its  operation?  Because  the 
processing  of  any  expression  involving  more  than  two  inputs  requires  the 
generation  of  intermediate  results,  which  must  also  be  both  stored  and 
refetched  from  memoryi  This  reduces  the  number  of  useful  memory  cycles 
available  in  a  unit  of  time  (the  memory's  effective  bandwidth),  increasing  the 
time  required  to  process  the  expression. 
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In  contrast,  if  the  expression  is  processed  directly  by  a  network  of 
merge  elements,  no  intermediate  results  are  generated.  The  time  required  for 
the  network  to  process  an  expression  in  this  case  is  approximately  equal  to 
the  sum  of  the  lengths  of  the  inputs  lists  and  the  output  list,  multipled  by 
the  memory  cycle  time.  If  intermediate  results  are  produced,  the  time 
required  is  this  basic  time  plus  twice  the  length  of  the  intermediate  results 
(since  they  are  both  stored  and  fetched). 

The  lengths  of  the  intermediate  results  are  dependent  upon  the  lengths  of 
the  input  lists  and  the  amount  of  overlap  between  these  inputs  *  If  the 
operation  is  to  form  the  AND  of  a  number  of  lists  with  negligible  overlap,  the 
size  of  the  intermediate  lists  will  be  small  and  the  savings  insignificant. 
However,  if  the  operation  is  the  OR  of  N  input  lists,  again  with  negligible 
overlap  and  length  L,  the  first  stage  in  the  tree  will  have  a  total  input 
length  of  N  L,  and  an  output  length  of  (2  N  L)/2,  or  M  L.  In  fact,  each  stage 
will  have  inputs  and  outputs  equal  in  length  to  N  L.-  Since  there  are  log  N 
stages  in  the  tree,  the  total  number  of  memory  cycles  required  is  2  N  L  log  N, 
with  only  2  N  L  being  used  for  fetching  input  lists  or  storing  the  output 
list.  The  effective  increase  in  bandwidth  using  a  tree  structure  which  does 
not  require  the  storing  and  refetching  of  intermediate  results  is  simply  the 
ratio  of  these  two  quantities,  or  log  N. 

In  other  words,  for  an  expression  containing  sixteen  input  lists,  it  is 
possible  to  increase  memory  performance  by  a  factor  of  one  to  four  by  simply 
using  fourteen  additional  low-cost  processing  elements.- 
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A  secondary  benefit  occurs  if  a  list  is  used  in  more  than  one  place  in  an 
expression.  Using  conventional  processing,  this  list  must  be  fetched  each 
place  it  is  used.  However,  it  is  only  necessary  to  fetch  it  once  for  all  the 
times  it  is  used  in  the  network,  again  reducing  the  number  of  memory  cycles 
required  to  process  an  expression. 

This  effective  increase  in  memory  speed  allows  two  options.  Either  the 
system  can  process  a  user's  request  faster,  thereby  shortening  response  times 
or  allowing  more  simultaneous  users  of  the  system,  or  if  this  higher  speed  is 
not  necessary,  low  speed  memory  can  be  used  to  give  speeds  comparable  to  a 
single  processor  using  high  speed  memory *  If  a  lower  speed  memory  is 
utilized,  this  would  allow  the  purchase  of  a  larger  memory  at  the  same  cost. 
With  this  larger  memory  available,  it  is  possible  that  fewer  of  the  final 
results  will  have  to  be  placed  on  the  disk,  being  saved  in  the  high  speed 
memory  for  use  with  later  queries.  This  further  increases  the  speed  of  the 
system  by  reducing  the  number  of  lists  which  must  be  stored  and  fetched  from 
the  comparatively  slow  disk  memory. 

3.2  Merge  Network  Hardware  Considerations 

There  are  two  reasonable  forms  for  the  network  to  take  —  either  the 
connections  between  the  elements  are  fixed,  or  they  can  be  changed  under 
control  of  the  host  computer.  In  either  case,  the  function  of  the  individual 
elements  is  controlled  by  setup  instructions  from  the  host  computer. 

For  a  network  with  N  inputs,  N  -  1  processing  elements  are  required. 
However,  if  an  Omega  network  [15]  is  used  to  interconnect  the  individual 
elements  and  the  memory,  order  N  log  N  switching  elements  are  required.   Other 
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switching  networks  require  at  least  as  many  switching  elements  as  the  Omega 
network.  For  large  networks,  the  number  of  gates  in  the  switching  network  is 
more  than  in  all  the  processing  elements.  Because  of  this,  only  the  fixed 
network  will  be  discussed  in  this  chapter* 

Since  each  element  has  two  inputs  and  one  output,  the  network  takes  the 
form  of  a  binary  tree.   This  means  that  if  a  network  has  a  possible  N  inputs 
(where  N  is  a  power  of  two) ,  it  contains  N  -  1  processing  elements   in  log  N 
stages.   Figure  3-2  illustrates  this  binary  tree,  and  some  of  the  terms  used 
when  referring  to  parts  of  the  tree  and  the  individual  elements.   Notice  that 
the  only  processing  element  inputs  and  outputs  which  are  connected  to  the 
memory  are  at  the  extreme  ends  of  the  tree.   Within  the  tree,   an  element's 
output  can  only  be  connected  to  another  element's  input.   Since  it  is  probable 
that  the  expression  specified  by  the  user  does  not  have  exactly  N  input  lists, 
some  method  must  exist  to  accommodate  it  without  the  necessity  of  expanding  or 
contracting  it  to  N  inputs.   This  can  be  done  by  defining  a  fourth  operation 
(in  addition  to  AND,  OR,  and  AND  NOT)  which  an  element  can  perform.   This  is 
the  PASS  operation,  which  transfers  data  from  the  X-input ,  if  it  is  full,   to 
the  output,  if  the  output  connection  is  empty,  and  marks  the  input  empty  when 
the  transfer  occurs.   The  Y-input  is  marked  empty  unconditionally  each  cycle 
to  prevent  an  improperly  specified  operation  from  blocking  the  network. 
However,  any  connection  which  has  an  input  list  as  a  predecessor  should  not  be 
used  as   a  Y-input  to  the  PASS  element.   Figure  3.3  illustrates  the  operation 
of  a  network  which  includes  a  PASS  element. 
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R  =  S  +  T  +  U 
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Figure  3.3  -  Network  with  PASS  Element 
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Operation  of  the  network  is  controlled  by  the  host  system.  Commands  are 
sent  from  the  host  system  to  the  controllers  for  the  processing  elements  to 
indicate  which  of  the  four  operations  it  should  perform.  The  starting 
addresses  in  memory  and  the  lengths  of  the  input  lists  and  the  output  area  is 
passed  to  the  memory  interface.  Finally,  all  connections  are  marked  empty, 
and  the  processing  started.  If  an  error  is  encountered,  or  when  processing 
has  finished,  the  host  processor  is  interrupted,  and  the  network  waits  for  the 
next  command. 

Memory  operation  is  a  function  of  the  output  being  full  or  an  input 
empty.  For  each  cycle  of  the  network,  the  output  connection  is  checked  and 
the  output  list  element  stored  if  the  connection  is  full.  If  no  output  was 
stored,  the  distributor  finds  an  input  which  is  empty,  and  fetches  the  next 
entry  of  that  input  list,  marking  the  input  connection  full.  If  a  list  is  the 
input  to  more  than  one  network  element,  the  distributor  does  not  fetch  a  new 
entry  until  all  elements  which  use  the  input  have  marked  it  empty.  Since  the 
lengths  of  the  lists  are  long  (and  of  course  if  they  weren't,  this  special 
hardware  would  not  be  necessary)  ,  no  special  priority  scheme  to  decide  which 
list  entry  to  fetch  next  is  necessary.  For  any  scheme  used,  the  time  required 
to  evaluate  the  expression  only  differs  by  the  time  required  to  flush  out  the 
pipeline  when  the  ends  of  the  input  lists  are  reached,  which  is  proportional 
to  the  height  of  the  tree. 

Figure  3.4  summarizes  the  number  of  gates  required  to  construct  the  tree 
using  serial  and  parallel  processing  elements.  The  time  required  for  the 
network  to  process  an  entry  is  approximately  that  given  for  a  single  processor 
in  Figure  2.12,  since  the  network  operates  as  a  pipelined  processor. 
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N  term  expression  (which  requires  N-1  processing  elements) 
X  bit  list  entry 

Total  gates  in  network  =  (N-1)proc  +  X(N+1)mem  + 

(N-2)conn  +  control 

control  =  number  of  gates  in  central  control  and 

interface  to  host  system  (depends  greatly 
on  host  system  requirements) 

conn    =  number  of  gates  in  inter-element  connection 
=  0(2) 

mem     =  number  of  gates  in  memory  interface 
=  0(2X) 

proc    =  number  of  gates  in  processing  element  and 

its  local  control 
=  0(30)  for  serial  processing  element 
=  0(1  OX)  for  parallel  processing  element 

using  ripple-carry  comparators  and  adders 

(more  if  carry-lookahead  used) 

Figure  3-4  -  Network  Gate  Counts 
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3.3   A  Serious  Problem 

Unfortunately,  the  network  structure  presented  above  will  not  work  in  all 
cases.  For  example,  Figure  3.5  illustrates  a  simple  network  which  produces  (L 
AND  M)  OR  (L  AND  N)  OR  (M  AND  N) .  For  clarity,  the  portion  of  the  network 
consisting  of  PASS  elements  has  been  deleted.  The  value  X  indicates  that  the 
connection  is  empty*  It  is  clear  from  the  table  of  connection  values  vs. 
time  that  the  tree  rapidly  reaches  a  point  where  no  changes  occur.  Element  E3 
is  prevented  from  placing  its  next  result  on  C?.  It  therefore  cannot  mark 
either  of  its  inputs  M  and  N  empty.  However,  new  entries  from  list  M  and  list 
N  must  be  fetched  before  E1  and  E2  can  produce  results.  These  in  turn  are 
needed  by  E4  and  E5  to  empty  C3.   The  network  is  hopelessly  deadlocked. 

This  deadlock  only  occurs  in  a  network  which  contains  AND  or  AND  NOT 
processing  elements,  and  only  if  an  input  list  is  used  as  the  input  to  more 
than  one  processor.  A  network  which  does  not  have  any  shared  inputs  lists 
cannot  deadlock,  since  if  one  input  connection  to  an  element  is  full,  the 
other  is  capable  of  supplying  entries  until  the  full  connection  can  be  marked 
as  empty. 

This  problem  can  be  solved  by  modifying  the  operation  of  the  element  and 
extending  the  concept  of  a  connection  being  full  or  empty.  Instead  of  a 
connection  being  full,  it  will  be  called  valid.  The  term  empty  will  be 
replaced  by  invalid.  The  difference  is  that  invalid  data  on  a  connection  has 
a  definite  value,  while  the  value  of  an  empty  connection  is  undefined.  Figure 
3.6  is  the  operations  table,  similar  to  the  one  in  Figure  1.10,  for  the 
modified  merge  processing  element.  It  shows  the  input  and  output  control  and 
the  formation  of  the  document  number  field.-   The  count  and  tag  fields  are 
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produced  as  before,  except  in  the  case  where  the  two  inputs  are  equal.  Here, 
an  input  is  only  used  to  form  the  output  field  if  it  is  valid.  The  major 
change  consists  of  unconditionally  placing  the  value  of  the  lower  input  on  the 
output  connection  if  the  output  connection  is  free  (does  not  contain  valid 
data).  In  addition,  the  element  compares  the  two  inputs  regardless  of  whether 
their  inputs  are  both  valid,  and  sets  the  validty  of  the  output  based  not  only 
on  the  relative  magnitudes  of  the  two  inputs,  but  also  on  their  validity. 
Figure  3.7  shows  the  operation  of  the  network  in  Figure  3-5,  but  with  the 
modified  element  operation.  A  value  in  parenthesis  indicates  that  that  value 
is  invalid. 

A  modified  network  is  one  constructed  from  processing  elements  following 
the  operations  table  given  in  Figure  3.6.  An  unmodified  network  is  one  formed 
according  to  the  original  operations  table  in  Figure  1.10.  The  following  show 
that  a  modified  network  will  never  deadlock,  and  will  produce  valid  output 
list  entries  identical  to  those  of  the  unmodified  network. 

Lemma  The  output  of  any  subtree  of  a  modified  network  consists  of  the 
union  of  its  input  lists,  in  increasing  order  with  all  duplicates 
removed.  Furthermore,  if  an  output  list  entry  is  invalid,  it  cannot 
become  valid  at  some  later  time* 

Proof  First  consider  a  subtree  consisting  of  a  single  merge  processing 
element  whose  inputs  are  the  input  lists.  By  the  operations  table,  it 
takes  the  lower  of  its  two  inputs  and  places  it  on  its  output.  It  then 
gets  a  higher  valued  input  to  replace  the  one  transferred  to  the  output. 
Therefore,  its  output  is  the  union  of  its  two  inputs,  in  increasing 
order*    If  the  same  entry  occurs  in  both  its  inputs,  only  one  output 
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entry  is  produced,  so  that  no  duplicates  exist  in  the  output*  Finally, 
once  an  output  has  been  produced,  the  input  that  was  used  to  form  it  is 
replaced  by  a  higher  valued  entry,  so  that  any  future  output  entry,  valid 
or  invalid,  must  be  greater  than  the  current  output  entry. 

Assume  the  Lemma  is  true  for  all  subtrees  consisting  of  N  stages.  A 
subtree  of  N  +  1  stages  consists  of  a  single  merge  processing  element, 
with  two  subtrees  of  N  stages  as  its  inputs.  Since  the  outputs  of  these 
two  subtrees  are  assumed  to  be  in  the  form  given  by  the  Lemma,  which  is 
the  same  form  as  for  an  input  list,  the  arguments  given  for  the  single 
element  subtree  also  hold  for  the  final  element  in  the  N  +  1  stage 
subtree.  Therefore,  the  output  of  this  subtree  is  in  the  form  given  by 
the  Lemma.   By  induction,  the  Lemma  is  true. 

Lemma  If  an  N  stage  subtree  of  a  modified  network  is  not  deadlocked, 
and  its  ouput  connection  remains  free  (either  by  no  valid  items  placed  on 
it  or  by  the  successor  element  of  the  subtree  immediately  marking  it  as 
invalid) ,  the  lowest  of  its  inputs  is  transfered  to  its  output  in  not 
more  than  N  cycles. 

Proof  For  a  subtree  consisting  of  a  single  element,  the  Lemma  is 
obvious.  Assume  the  Lemma  is  true  for  all  subtrees  of  N  -  1  stages. 
Remember  that  a  tree  of  N  stages  consists  of  two  subtrees  of  N  -  1  stages 
as  inputs  to  a  single  element  at  stage  N.-  After  N  -  1  cycles,  each  of 
these  subtrees  has  transferred  the  lowest  of  its  inputs  to  its  output. 
At  cycle  N,  the  final  stage's  element  takes  the  lower  of  these  two 
subtree  outputs,  which  is  the  lowest  of  the  inputs  to  both  subtrees,  and 
places  it  on  its  output  connection.  Therefore,  the  Lemma  holds  for  all 
subtrees  of  N  stages,  and  by  induction  is  true. 
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Theorem   A  modified  network  cannot  deadlock. 

Proof  Assume  the  network  is  deadlocked.  Therefore,  one  or  more 
elements  is  unable  to  place  a  valid  entry  on  its  output  connection 
because  a  previous  valid  entry  has  not  been  marked  invalid  by  the  element 
which  has  the  connection  for  an  input.  Furthermore,  this  condition  has 
been  in  existence  for  an  arbitrary  length  of  time.  This  is  a  blocked 
connection. 

Consider  the  blocked  connection  B  closest  to  the  input  end  of  the 
tree.  The  subtree  which  has  B  as  its  output  connection  will  be  called  X, 
the  element  with  B  as  its  input,  E,  and  the  subtree  which  feeds  E's  other 
input  will  be  called  Y.  Because  connection  B  is  blocked,  E  only  takes 
its  input  from  subtree  Y. 

If  there  are  no  input  lists  in  common  between  X  and  Y,  Y  will 
continue  to  transfer  its  input  list  entries  to  element  E's  input. 
Eventually,  an  entry  (possibly  the  end-of-list  marker)  greater  than  the 
value  in  connection  B  will  occur  at  the  output  of  Y.  This  will  allow  E 
to  process  the  value  in  B,  unblocking  it.  Hence,  the  network  cannot 
deadlock  if  there  is  not  at  least  one  input  list  in  common  between  X  and 
Y. 

If  there  is  an  input  in  common,  the  value  of  its  list  entry  is 
greater  than  or  equal  to  the  value  in  connection  B.  This  is  because  if 
there  were  an  input  to  a  subtree  less  than  the  value  of  its  output,  at 
some  later  time  the  lower  input  would  occur  as  an  output  entry.  But  this 
cannot  occur  because  of  the  first  Lemma  above. 
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Therefore,  subtree  Y  has  at  least  one  input  which  is  greater  than  or 
equal  to  the  value  in  connection  B.  Since  Y  is  not  blocked,  eventually 
the  value  at  the  common  input  will  be  the  lowest  input  to  Y.  By  the 
second  Lemma,  shortly  thereafter  this  value  will  be  the  output  of  Y. 
Since  this  value  is  greater  than  or  equal  to  the  value  in  connection  B, 
connection  B  will  be  marked  invalid  by  element  E.  Hence,  the  tree  is  not 
deadlocked. 

Theorem  An  OR  element  modified  according  to  the  operations  table  in 
Figure  3.6  produces  the  same  valid  items  in  the  same  order  as  one 
constructed  according  to  the  operations  table  in  Figure  1.10. 

Proof  The  only  parts  of  the  new  operations  table  which  must  be  examined 
are  those  which  differ  from  the  original  table.  These  fit  two  different 
categories:  the  input  with  the  lower  value  is  valid  but  the  higher  input 
is  invalid,  or  both  inputs  are  equal,  but  only  one  is  valid: 

In  the  first  case,  a  valid  result  is  produced  and  the  lower  input  is 
marked  as  invalid,  where  previously  no  action  was  taken.  However,  this 
can  produce  an  incorrect  action  only  if  at  the  next  time  both  inputs  are 
valid,  the  input  which  held  the  higher  invalid  input  now  contains  a  valid 
entry  less  than  or  equal  to  the  original  lower  input.  However,  by  the 
above  Lemma,  this  cannot  occur.  Hence,  the  OR  element  functions 
correctly  in  this  case. 
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In  the  second  case,  the  element  produces  a  valid  result  even  though 
only  one  of  the  two  equal  inputs  is  valid.  Again,  this  is  an  incorrect 
action  only  if  the  input  which  contains  the  invalid  entry  were  to  contain 
a  valid  entry  less  than  or  equal  to  the  current  invalid  entry  at  some 
later  point  in  time.  Since  by  the  Lemma  this  cannot  occur,  the  operation 
is  performed  correctly. 

A  similar  proof  can  be  used  to  show  that  the  AMD,  AND  MOT,  and  PASS 
elements  function  correctly.  Since  all  elements  in  the  network  function 
correctly,  it  is  clear  that  the  network  will  always  yield  the  correct  results. 

3-4  Parsing  Expressions  for  a  Fixed  Tree  Size 

If  the  expression  can  be  contained  in  the  available  tree,  how  the 
expression  is  parsed  has  no  effect  on  the  required  processing  time. 
Disregarding  end-of-list  effects,  the  time  required  to  process  the  expression 
is  identical  for  all  forms  of  the  expression*  It  is  simply  proportional  to 
the  lengths  of  the  input  and  output  lists,  since  the  network  is  pipelined  and 
only  one  entry  can  be  transferred  to  or  from  the  memory  in  any  one  network 
cycles  However,  if  the  expression  cannot  be  contained  in  the  available  tree 
in  any  form,  the  problem  of  reducing  the  processing  time  becomes  more  complex. 

In  the  following  discussion,  subexpression  will  mean  that  portion  of  the 
total  expression  which  can  be  processed  directly  by  the  available  tree.  The 
processing  of  a  subexpression  by  the  tree  will  be  termed  a  pass,  with  the 
first  subexpression  processed  during  pass  one.  Figure  3.8  illustrates  a 
possible  scheme  for  numbering  the  passes.  All  the  passes  performed  at  the 
same  level  of  trees  in  the  processing  of  an  expression  will  be  referred  to  as 
a  level. 
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The  following  discussion  assumes  that  the  overlaps  between  lists  is 

negligible,   so  that  the  length  of  the  list  produced  by  an  OR  operation  is  the 

sum  of  the  two  input  lists'  lengths,  while  the  length  of  an  AND  is  an 
extremely  small  number,  and  can  be  regarded  as  zero. 

The  time  required  to  process  an  expression  is  proportional  to  the  sum  of 
the  lengths  of  the  inputs  to  all  the  passes  plus  the  lengths  of  the  outputs 
from  these  passes.  In  particular,  each  entry  of  an  intermediate  list  must  be 
counted  twice  —  once  as  an  output  and  once  as  an  input.  It  is  therefore 
important  to  minimize  the  lengths  of  the  intermediate  lists.  In  addition,  if 
a  list  appears  as  the  input  to  more  than  one  pass,  it  must  be  counted 
separately  for  each  pass  to  determine  the  processing  time.  The  major 
trade-off  is  then  whether  to  fetch  a  list  more  than  once  in  hopes  of  reducing 
the  length  of  the  intermediate  results* 

The  easiest  expressions  to  parse  into  multiple  passes  are  those 
consisting  of  all  AND's  or  all  OR's.  It  is  obvious  that  it  is  not  necessary 
for  an  input  list  to  be  used  more  than  once,  so  the  only  problem  is  to 
minimize  the  lengths  of  the  intermediate  results.  For  an  expression 
consisting  of  all  AND's,  the  length  of  any  intermediate  list  is  negligible, 
and  therefore  the  results  of  any  parsing  scheme  will  require  the  same 
processing  time. 

For  an  OR  expression,  however,  the  length  of  an  intermediate  list  is 
equal  to  the  sum  of  the  lengths  of  the  inputs  to  the  pass  creating  the 
intermediate  list.-   In  this  case,  it  is  possible  to  reduce  the  lengths  of  the 


intermediate  lists  by  including  the  longest  input  lists  in  the  highest  level 
possible.  However,  each  time  an  input  list  is  used  as  the  input  to  a  pass,  it 
reduces  by  one  the  number  of  passes  available  in  the  next  lower  level. 
Therefore,  if  too  many  input  lists  are  used  at  a  higher  level,  the  number  of 
passes  remaining  at  the  lower  level  may  be  unable  to  handle  the  expression, 
necessitating  an  additional  lower  level,  which  will  create  more  intermediate 
results.  If  the  length  of  the  input  list  being  included  in  the  higher  level 
is  more  than  the  lengths  of  the  input  lists  which  are  used  to  form  the  passes 
in  the  new  lower  level,  the  input  list  should  be  included  in  the  higher  level; 
if  not,  it  should  be  included  in  a  lower  level* 

A  more  interesting  expression  is  the  one  studied  by  Jane  Liu  [16]  for 
conventional  processors.  This  is  a  sum  of  terms  multiplied  by  a  single  term, 
such  as 

(A1  OR  A2  OR  A3  OR  ...   OR  AN)  AND  B 
An  expression  of  this  type  requires  a  tree  whose  height  is  equal  to  1  +  log  n. 
This  means  that  at  least  half  the  elements  of  the  tree  which  produces  the 
expression  are  PASS  elements* 

Each  element  at  the  input  end  of  a  tree  has  two  input  connections.  There 
are  three  classes  of  inputs  which  can  be  applied  to  this  input  pair 
(illustrated  in  Figure  3*9): 

Class  AB,  the  simplest  class,  consisting  of  one  of  the  A  inputs  and  the  B 
input.  In  this  case,  the  input  element  is  used  as  an  AND, 
effectively  distributing  B  to  the  selected  A  input.  (Note  that  while 
this  is  a  reasonable  way  to  view  the  operation,  if  this  B  connection 


58 


LEVEL  1 


LEVEL   3 


PASS     NUMBER 


Figure  3.8  -  Multi-Pass 


Expression  Process! 


ng 


AB  INPUT   PAIR 


A.         B 


SB   INPUT    PAIR 


SS    INPUT   PAIR 


Aj.l  Aj,2n 
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and  any  other  input  connection  were  interchanged,  the  time  required 
to  process  the  expression  would  remain  the  same,  although  the 
elements  would  be  assigned  different  functions.) 

Class  SB,  consisting  of  B  and  a  pass  from  a  lower  level   which  does  not 
include  B  in  any  of  its  inputs. 

Class  SS,  consisting  of  two  passes  from  the  next  lower  level,   each  of 

which  contains   B  in   its  inputs.   It  is  important  to  note  that  the 

number  of  A  inputs  available  at  the  next  lower  level  is  identical  for 
both  the  SB  and  SS  input  pair  classes. 

The  following  algorithm  reduces  the  time  required  to  process  the 
expression  given  above:   input  pairs. 

1.  Start  at  the  highest  level  of  the  expression,  setting  B*  =  0. 

2-.  Find  the  longest  remaining  A  list.  If  using  it  in  an  AB  input  pair  would 
force  another  level  in  order  to  process  the  expression,  go  to  Step  3. 
Otherwise,  use  it  in  an  AB  input  pair  and  set  B*  =  Length(B).  If  there 
are  no  more  A's  left,  the  algorithm  is  finished.  If  only  one  input  pair 
remains  unfilled  and  more  than  one  A  input  remains,  go  to  Step  4. 
Otherwise,  repeat  this  step. 

3-  If  the  longest  remaining  A  causes  the  creation  of  a  new  level,  and  the  sum 
of  the  input  lengths  to  this  new  level  is  more  than  the  length  of  A,  go  to 
Step  4.  Otherwise,  use  it  in  an  AB  input  pair  and  set  B*  =  Length(B).  If 
there  are  no  more  A's  left,  the  algorithm  is  fininshed.  If  only  one  input 
pair  remains  unfilled  and  there  is  more  than  one  remaining  A  input,  go  to 
Step  4.   Otherwise,  repeat  this  step. 
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4.  Group  the  remaining  A's  into  as  many  subexpressions  as  there  are  remaining 
input  pairs.  As  much  as  possible,  include  the  longest  remaining  A  inputs 
at  the  highest  levels  and  in  the  same  subexpressions. 

5.  For  each  remaining  subexpression,  if  the  sum  of  the  lengths  of  its  inputs 
is  greater  than  the  length  of  B  plus  B*,  go  to  Step  6.  Otherwise,  use  the 
subexpression  in  an  SB  input  pair  and  set  B*  =  Length (B).  If  no  more 
subexpressions  exist,  the  algorithm  is  finished.  Otherwise,  repeat  this 
step* 

6.  Use  this  algorithm  to  process  the  subexpression  AND'ed  with  B.  Go  to  Step 
5i 

The  algorithms  for  parsing  other  forms  of  expressions  are  similar,  with 
the  trade-off  between  fetching  an  input  more  than  once  against  intermediate 
results  of  longer  lengths. 
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CHAPTER  H   —   FUTURE  RESEARCH 


The  previous  chapters  discussed  merge  processors  and  trees  constructed 
from  these  processors,  in  the  form  which  would  be  used  in  most  cases. 
However,  certain  assumptions  were  made  regarding  the  configuration  of  the 
tree,  the  relative  bandwidths  of  the  processors  and  its  memory,  and  the  actual 
implementation  of  both  processors  and  networks ,  In  addition,  no  special 
considerations  were  made  for  the  system  which  can  support  more  than  one  user 
at  a  time. 

Initial  research  has  been  conducted  in  these  areas.  This  chapter 
presents  the  preliminary  results  of  this  research. 

4s1   Higher  Bandwidth  Memories 

In  the  previous  discussions,  it  was  assumed  that  the  cycle  time  of  the 
memory  was  less  than  or  equal  to  the  time  required  for  a  merge  processor  to 
handle  a  single  entry.  This  assumption  is  valid  for  cycle  times  of  large 
semiconductor  memories  currently  available.  Given  a  merge  processor 
implemented  in  commercially  available  Schottly  TTL  logic,  the  memory  cycle 
time  would  have  to  be  greater  than  80  ns.  for  the  assumption  to  be  invalid. 
Available  and  projected  logic  families  would  allow  the  processor  to  operate  at 
even  higher  speeds. 

Furthermore,  it  is  the  memory  which  holds  the  various  lists,  rather  than 
the  high  speed  buffer  memory  which  must  be  considered.  This  is  because  the 
lists  must  be  brought  from  the  bulk  memory  to  the  high  speed  memory  before  the 
merge  processor  can  use  them.-   The  time  required  for  this  is  a  function  of  the 
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bulk  memory  transfer  rate.  Current  technology  dictates  that  this  bulk  memory 
be  some  form  of  disk  system,  because  of  the  high  volume  of  data.  Therefore, 
the  processor  speed  is  substantially  greater  than  the  memory  cycle  time. 
Until  faster  technologies  are  available  for  low  cost  mass  storage  of  data 
(such  as  CCD  shift  register  memories) ,  the  assumption  regarding  the  processor 
vs.  memory  speeds  will  hold. 

It  is  still  important  to  understand  the  network  behavior  when  the  memory 
cycle  time  is  less  than  the  merge  processor  speed .  However,  due  to  the  nature 
of  the  network,  much  of  its  behavior  depends  on  the  data  contained  in  the 
lists  specified  by  the  expression.  Certain  characteristics  of  the  processors 
can  be  examined,  to  better  assess  the  network's  behavior*  The  action  of  an 
AND  element  reduces  the  size  of  a  list,  while  an  OR  increases  it.  Assuming 
negligible  list  overlap,  when  connected  in  a  network  an  OR  element  transfers 
one  of  its  input  to  its  output  whenever  the  output  connection  is  free.  On  the 
average,  the  input  lists  are  processed  at  half  the  rate  that  the  output  list 
of  the  OR  element  is  being  processed  by  the  OR  element's  successor.  For  an 
AND  element,  the  normal  action  is  to  process  one  input  element  every  network 
cycle,  regardless  of  the  action  of  the  successor. 

It  can  be  shown  that  the  maximum  memory  parallelism  (expressed  as  a  ratio 
between  the  memory  cycle  time  and  the  processor  speed)  utilized  in  the  steady 
state  by  a  network  of  all  OR's  is  2.  This  is  because  the  network  produces  an 
output  and  requires  only  one  input  during  each  memory  cycle.  However,  for  a 
network  whose  input  stage  consists  of  all  AND's,  the  maximum  parallelism  which 
can  be  utilized  (again  assuming  negligible  list  overlap)  is  equal  to  the 
number  of  input  stage  AND  processors* 
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Complicating  the  analysis  is  the  case  where  a  list  is  used  by  more  than 
one  processor  as  an  input.  In  this  case,  the  network  actions  depend  greatly 
on  the  data  contained  in  the  input  lists.  The  best  method  to  analyze  the 
network  behavior  in  this  case  is  with  a  simulator 5 

Figure  4.1  illustrates  the  results  of  a  number  of  simulations  of 
representative  expressions  with  different  amounts  of  memory  parallelism.  For 
each  expression,  the  first  column  is  the  memory  parallelism,  the  second  is  the 
number  of  network  cycles  required  to  completely  process  the  expression,  the 
third  is  the  memory  utilization,  and  the  final  column  is  the  processor 
requirement.  The  memory  utilization  is  simply  the  number  of  memory  cycles 
actually  used  divided  by  the  total  number  of  memory  cycles  available  (which  is 
the  product  of  the  parallelism  and  the  time  required  to  process  the 
expression).  The  processor  requirement  is  the  number  of  processors  needed  by 
the  expression,  which  is  given  next  to  the  expression  in  the  figure,  times  the 
time  required  to  process  the  expression.  The  data  used  for  the  simulation 
consisted  of  lists  of  100  random  numbers,  with  5  numbers  common  to  all  lists, 
and  from  1  to  15  different  numbers  common  between  pairs  of  lists. 

The  differences  in  time  required  to  process  equivalent  expressions  when 
the  parallelism  is  one  are  due  to  end-of -processing  conditions.  A  decrease  of 
one  cycle  in  the  processing  time  occurs  with  each  increase  in  parallelism, 
since  one  more  input  can  be  initially  loaded. 

The  results  show  that  distributing  a  product  of  sums  expression  into  its 
sum  of  products  form  reduces  the  time  required  and  increases  the  memory 
utilization.  However,  it  greatly  increases  the  processor  requirement,  far 
more  than  the  increase  in  memory  utilization  *   In  the  case  of  a  fixed  tree,  it 
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EXPRESSION 

(A+BHC+D) 

[3  Processors  Required] 

( AC+AD+BC+BD) 

[7  Processors  Required] 

(A+B+C)(D+E+F) 

[5  Processors  Required] 

( AD+AE+AF+BD+BE+BF+CD+CE+CF) 

[17  Processors  Required] 

(A+B)(C+D)(E+F) 

[5  Processors  Required] 


MEMORY   CYCLES   MEMORY     PROCESSOR 
PARALLELISM      UTILIZATION  REQUIREMENT 


(AB+CD) 


[3  Processors  Required] 


1 
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2 
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0.59 
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358 

0.40 
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357 

0.30 
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1 

431 
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2 

310 

O.-69 

2190 

1 

300 

0.47 

2100 

290 

O.36 

2093 

1 

654 

n.99 

3270 

2 

524 

0.62 

2620 

3 

512 

0,42 

2S60 

l| 

512 

O.31 

2560 

5 

511 

0.25 

2555 

6 

511 

0.21 

2555 

1 

656 

0.98 

11152 

2 

441 

0.73 

7497 

3 

409 

0.53 

6953 

4 

408 

0.40 

6936 

5 

408 

0.32 

6936 

6 

407 

0.26 

6919 

1 

610 

0.99 

3050 

2 

40? 

0.75 

2015 

3 

386 

0.52 

1930 

4 

384 

0.39 

1920 

1 

417 

0.99 

1251 

2 

216 

0.96 

648 

3 

190 

0.69 

597 

4 

198 

0.52 

594 

Figure  4.1  -  Simulation  Results 
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is  therefore  advantageous  to  distribute  to  increase  the  number  of  input  stage 
AND's,  as  long  as  the  expression  can  be  contained  in  the  tree.  For  the  last 
expression  given  in  Figure  U« 1 ,  that  of  a  sum  of  products  whose  product  terms 
have  independent  inputs,  it  is  clear  that  ideal  memory  utilization  exists  when 
the  parallelism  is  less  than  or  equal  to  the  number  of  input  AMD's. 

Simulations  were  also  made  to  determine  the  effects  of  first  in/first  out 
buffers  between  all  the  elements  in  the  tree.  In  all  but  a  few  extreme  cases, 
there  was  no  benefit  in  using  FIFO's,  and  in  the  cases  where  there  was  benefit 
it  was  negligible.  Different  techniques  for  determining  which  empty  input  to 
fill  were  also  tried,  again  with  the  differences  being  slight. 

U.2  Multi-User  Systems 

Since  it  is  reasonable  to  assume  that  any  system  requiring  this 
specialized  hardware  is  able  to  support  more  than  one  user  at  a  time,  it  may 
be  convenient  to  use  more  than  one  processor  tree  in  the  system.  This  can  be 
done  so  that  a  time  consuming  request  does  not  block  smaller  requests,  or  to 
take  better  advantage  of  high  memory  parallelism.  In  this  case,  many  medium 
sized  trees  (the  size  dependent  on  the  normal  expression  lengths  for  the 
system)  can  be  used*  Again,  the  expression  is  distributed  to  fill  the  tree, 
but  this  is  less  important  than  previously,  since  the  effective  parallelism  is 
equal  to  the  memory  parallelism  divided  by  the  number  of  trees  in  use  at  a 
given  time.  To  avoid  a  single  tree  monopolizing  the  memory,  the  following 
memory  priority  scheme  can  be  used  if  the  parallelism  is  greater  than  twice 
the  number  of  trees.  First,  store  all  outputs  from  the  trees;  then  fill  one 
input  of  each  tree;  if  more  memory  cycles  are  available,  fill  any  remaining 
inputs.   If  the  parallelism  is  less  than  twice  the  number  of  trees,  a  similar 
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scheme  can  be  used,  but  with  a  commutator  to  decide  which  tree  should  be  given 
the  first  try  at  using  the  memory. 

A  special  use  for  multiple  processor  trees  is  presented  by  the  EXPLODE 
command,  as  discussed  in  Chapter  1.  This  command  forms  the  OR  of  a  large 
number  of  lists.  It  is  not  reasonable  to  have  every  tree  large  enough  to 
handle  EXPLODE's,  which  occur  frequently  enough  to  be  considered  but  not 
frequently  enough  to  warrant  making  every  user's  tree  large  enough  to  handle 
thenu 

Two  possible  solutions  exist.  Either  one  or  more  large  tree  networks  can 
be  constructed  to  be  used  especially  for  EXPLODE's,  or  a  Batcher  merge 
configuration  similar  to  that  proposed  by  Stellhorn  can  be  used.  In  the 
latter  case,  it  may  not  be  necessary  to  construct  as  complex  a  coordination 
network.  If  the  overlap  in  the  list  entries  is  slight,  then  very  few  entries 
must  be  eliminated.  In  this  case,  it  may  be  easier  and  faster  to  leave  them 
in  the  list,  with  a  special  flag  indicating  that  they  are  not  to  be 
considered.  Since  it  is  likely  that  the  results  of  the  EXPLODE  will  be  used 
as  an  input  to  some  later  expression,  the  merge  processing  tree  can  be 
modified  to  ignore  any  list  entry  with  this  special  bit  set,  much  as  it 
presently  ignores  entries  which  do  not  have  the  proper  context  bits  set. 

The  primary  benefit  of  the  Stellhorn/Batcher  network  is  that  it  can  be 
expanded  without  limit  to  match  the  bandwidth  of  the  memory.  (However,  it 
should  be  remembered  that  the  cost  of  the  network  is  prder  N  log  N.)  The 
expansion  of  the  merge  processing  tree  discussed  in  Chapter  3  is  limited  by 
the  length  of  the  list  element  (which  determines  how  many  input  bits  can  be 
processed  in  parallel) ,  and  the  complexity  of  the  expression  (which  determines 
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the  maximum  practical  tree  size.  But  the  Stellhorn/Batcher  network  has  the 
same  problem  with  intermediate  results  as  the  single  merge  processor  discussed 
in  Chapter  2.  Although  it  is  possible  to  connect  Stellhorn's  merge  processor 
to  form  a  tree,  the  amount  of  hardware  which  would  be  necessary  for  a  tree  to 
handle  an  EXPLODE  would  make  it  currently  impractical. 

Although  it  depends  on  the  size  of  the  list  entry,  the  amount  of 
parallelism  in  the  Batcher  network,  and  the  number  of  terms  in  the  EXPLODE,  it 
appears  that  for  EXPLODE's  on  the  order  of  64  terms  that  the  time  required 
using  the  tree  network  is  less  than  using  a  Stellhorn/Batcher  system. 
Slightly  more  gates  are  required  for  the  tree  network,  but  the  product  of 
gates  times  time  is  less  for  the  tree  network < 
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CONCLUSIONS 


It  has  been  observed  that  much  of  the  time  required  to  process  a  request 
on  a  large  scale  inverted  filed  information  retrieval  system  is  used  to 
transfer  and  merge  lists  of  pointers  to  items  in  the  text  files.  This  problem 
is  ill  suited  for  conventional  digital  computers,  which  are  organized 
primarily  for  numeric  operations.  Less  that  ten  percent  of  the  memory  cycles 
used  by  the  merge  subroutines  are  to  directly  process  the  data. 

It  is  possible  to  construct  specialized  merge  processors,  which  can 
operate  at  speeds  faster  than  the  memories  generally  available  on  conventional 
processors,  thus  increasing  the  processing  speed  by  a  factor  greater  than  ten. 
In  fact,  it  is  possible  for  a  processor  of  this  type  to  operate  ten  times 
faster  than  the  memory  on  the  IBM  System/360  Model  75,  so  that  with 
sufficiently  high  speed  memory,  the  processing  time  for  a  merge  can  be  reduced 
by  a  factor  of  over  one  hundred. 

These  simple  processors  can  also  be  connected  together  to  form  networks, 
primarily  binary  trees  *  This  further  reduces  the  time  required  to  process  an 
expression,  since  the  number  of  memory  cycles  required  to  store  and  later 
refetch  intermediate  results  is  reduced.  In  fact,  if  the  tree  is  large  enough 
to  accommodate  the  complete  expression,  no  intermediate  results  will  be 
produced.  For  an  expression  with  N  terms,  this  can  result  in  an  increase  in 
the  effective  bandwidth  of  log  Ni  An  additional  increase  in  effective  memory 
bandwidth  occurs  if  a  list  is  used  more  than  once  in  an  expression.  In  this 
case,  it  can  be  fetched  only  once  and  distributed  to  the  various  network 
inputs,  while  conventional  approaches  would  require  it  to  be  fetched  for  each 
time  it  was  used. 
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As  long  as  the  expression  can  be  contained  in  the  tree,  any  scheme  used 
for  parsing  it  will  result  in  the  same  processing  time.  If  it  is  larger  than 
the  available  tree,  it  can  be  divided  into  an  number  of  passes  thru  the  tree 
so  that  the  time  required  is  reduced.  This  technique  was  illustrated  for  a 
number  of  different  forms  of  expressions. 

The  total  speed  increase  is  highly  dependent  on  the  structure  of  the 
input  lists  and  the  form  of  the  expression.  The  factors  mentioned  above 
combine  multiplicatively ,  so  even  if  the  memory  used  has  the  same  bandwidth  as 
for  the  conventional  computer,  it  is  possible  to  achieve  increases  of  greater 
than  ten  for  a  simple  two  term  expression.  If  the  user  specifies  an  EXPLODE 
which  forms  the  OR  of  64  different  lists,  it  is  possible  that  the  speed 
increase  will  be  from  60  to  100.  The  use  of  higher  speed  memory,  to  better 
match  the  speed  of  the  merge  processors,  increases  these  speeds  by  a  factor  of 
10.  It  is  therefore  possible  to  process  large  EXPLODE's  up  to  1000  times 
faster  than  a  conventional  general  purpose  digital  computer. 

The  introduction  of  a  specialized  list  merging  system  of  this  type  into  a 
large  scale  inverted  file  information  retrieval  system  either  can  drastically 
improve  the  response  time  for  the  existing  users,  or  can  allow  substantially 
more  users  to  utilize  the  system  at  the  same  time. 
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